<!--begin.rcode results='hide', echo=FALSE, message=FALSE
library(caret)
library(pROC)
library(DMwR)
library(ROSE)
library(randomForest)

library(parallel)
library(doMC)
registerDoMC(cores=detectCores()-1)

theme_set(theme_bw())

hook_inline = knit_hooks$get('inline')
knit_hooks$set(inline = function(x) {
  if (is.character(x)) highr::hi_html(x) else hook_inline(x)
  })
opts_chunk$set(comment=NA)

session <- paste(format(Sys.time(), "%a %b %d %Y"),
                 "using caret version",
                 packageDescription("caret")$Version,
                 "and",
                 R.Version()$version.string)
    end.rcode-->

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  <!--
  Design by Free CSS Templates
http://www.freecsstemplates.org
Released for free under a Creative Commons Attribution 2.5 License

Name       : Emerald 
Description: A two-column, fixed-width design with dark color scheme.
Version    : 1.0
Released   : 20120902

-->
  <html xmlns="http://www.w3.org/1999/xhtml">
  <head>
  <meta name="keywords" content="" />
  <meta name="description" content="" />
  <meta http-equiv="content-type" content="text/html; charset=utf-8" />
  <title>Subsampling For Class Imbalances</title>
  <link href='http://fonts.googleapis.com/css?family=Abel' rel='stylesheet' type='text/css'>
  <link href="style.css" rel="stylesheet" type="text/css" media="screen" />
  </head>
  <body>
  <div id="wrapper">
  <div id="header-wrapper" class="container">
  <div id="header" class="container">
  <div id="logo">
  <h1><a href="#">Subsampling For Class Imbalances</a></h1>
</div>

  </div>
  <div><img src="images/img03.png" width="1000" height="40" alt="" /></div>
  </div>
  <!-- end #header -->
<div id="page">
  <div id="content">
  
<h1>Contents</h1>  
<ul>
  <li><a href="#methods">Subsampling Techniques</a></li>
  <li><a href="#resampling">Subsampling During Resampling</a></li>
  <li><a href="#complications">Complications</a></li>
  <li><a href="#custom">Using Custom Subsampling Techniques</a></li>  
</ul>   
      

<p>
In classification problems, a disparity in the frequencies of the observed classes can have a significant negative impact on model fitting. One technique for resolving such a class imbalance is to subsample the training data in a manner that mitigates the issues. Examples of sampling methods for this purpose are:
</p>
<ul>
  <li>
  	<i>down-sampling</i>:  randomly subset all the classes in the training set so that their class frequencies match the least prevalent class. For example, suppose that 80% of the training set samples are the first class and the remaining 20% are in the second class. Down-sampling would randomly sample the first class to be the same size as the second class (so that only 40% of the total training set is used to fit the model). <strong>caret</strong> contains a function (<span class="mx funCall">downSample</span>) to do this.
  </li>
  <li>
  	<i>up-sampling</i>: randomly sample (with replacement) the minority class to be the same size as the majority class. <strong>caret</strong> contains a function (<span class="mx funCall">upSample</span>) to do this. 
  </li>
  <li>
  	<i>hybrid methods</i>: techniques such as <a href="https://scholar.google.com/scholar?hl=en&q=SMOTE&btnG=&as_sdt=1%2C33&as_sdtp=">SMOTE</a> and <a href="https://scholar.google.com/scholar?q=%22Training+and+assessing+classification+rules+with+imbalanced+data%22&btnG=&hl=en&as_sdt=0%2C33">ROSE</a> down-sample the majority class and synthesize new data points in the minority class.  There are two packages (<strong>DMwR</strong> and <strong>ROSE</strong>) that implement these procedures. 
  </li> 
</ul>  
<p>
Note that this type of sampling is different from splitting the data into a training and test set. You would never want to artificially balance the test set; its class frequencies should be in-line with what one would see "in the wild". Also, the above procedures are independent of resampling methods such as cross-validation and the bootstrap. 
</p>
<p>
In practice, one could take the training set and, before model fitting, sample the data. There are two issues with this approach
</p>
<ul>
  <li>
  	Firstly, during model tuning the holdout samples generated during resampling are also glanced and may not reflect the class imbalance that future predictions would encounter. This is likely to lead to overly optimistic estimates of performance.
  </li>
  <li>
  	Secondly, the subsampling process will probably induce more model uncertainty. Would the model results differ under a different subsample? As above, the resampling statistics are more likely to make  the model appear more effective than it actually is.  
  </li> 
</ul> 
<p>
The alternative is to include the subsampling inside of the usual resampling procedure. This is also advocated for pre-process and feature selection steps too. The two disadvantages are that it might increase computational times and that it might also complicate the analysis in other ways (see the <a href="#complications">section below</a> about the pitfalls). 
</p>

<div id="methods"></div>        
<h1>Subsampling Techniques</h1>

<!--begin.rcode samp_imbal_data0, results='hide', echo=FALSE, message=FALSE
set.seed(2969)
imbal_train <- twoClassSim(10000, intercept = -20, linearVars = 20)
imbal_test  <- twoClassSim(10000, intercept = -20, linearVars = 20)
    end.rcode-->
<br>      
<p>
To illustrate these methods, let's simulate some data with a class imbalance using this method. We will simulate a training and test set where each contains <!--rinline I(nrow(imbal_train)) --> samples and a minority class rate of about <!--rinline I(round(mean(imbal_train$Class == "Class2")*100, 1)) -->%:
</p>
<!--begin.rcode samp_imbal_data, echo=TRUE, message=FALSE, cache = TRUE
library(caret)

set.seed(2969)
imbal_train <- twoClassSim(10000, intercept = -20, linearVars = 20)
imbal_test  <- twoClassSim(10000, intercept = -20, linearVars = 20)
table(imbal_train$Class)
    end.rcode-->
<br>  
<p>
Let's create different versions of the training set prior to model tuning:
</p>
<!--begin.rcode subsample, echo=TRUE, message=FALSE, cache = TRUE
set.seed(9560)
down_train <- downSample(x = imbal_train[, -ncol(imbal_train)],
                         y = imbal_train$Class)
table(down_train$Class)   

set.seed(9560)
up_train <- upSample(x = imbal_train[, -ncol(imbal_train)],
                     y = imbal_train$Class)                         
table(up_train$Class) 

library(DMwR)

set.seed(9560)
smote_train <- SMOTE(Class ~ ., data  = imbal_train)                         
table(smote_train$Class) 

library(ROSE)

set.seed(9560)
rose_train <- ROSE(Class ~ ., data  = imbal_train)$data                         
table(rose_train$Class) 
    end.rcode-->
<br>    
<p>
For these data, we'll use a bagged classification and estimate the area under the ROC curve using five repeats of 10-fold CV.
</p>
<!--begin.rcode outside_models, echo=TRUE, message=FALSE, cache = TRUE
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary)

set.seed(5627)
orig_fit <- train(Class ~ ., data = imbal_train, 
                  method = "treebag",
                  nbagg = 50,
                  metric = "ROC",
                  trControl = ctrl)

set.seed(5627)
down_outside <- train(Class ~ ., data = down_train, 
                      method = "treebag",
                      nbagg = 50,
                      metric = "ROC",
                      trControl = ctrl)

set.seed(5627)
up_outside <- train(Class ~ ., data = up_train, 
                    method = "treebag",
                    nbagg = 50,
                    metric = "ROC",
                    trControl = ctrl)

set.seed(5627)
rose_outside <- train(Class ~ ., data = rose_train, 
                      method = "treebag",
                      nbagg = 50,
                      metric = "ROC",
                      trControl = ctrl)


set.seed(5627)
smote_outside <- train(Class ~ ., data = smote_train, 
                       method = "treebag",
                       nbagg = 50,
                       metric = "ROC",
                       trControl = ctrl)
    end.rcode-->
<br>  
<p>
We will collate the resampling results and create a wrapper to estimate the test set performance:
</p>
<!--begin.rcode outside_results, echo=TRUE, message=FALSE, cache = TRUE
outside_models <- list(original = orig_fit,
                       down = down_outside,
                       up = up_outside,
                       SMOTE = smote_outside,
                       ROSE = rose_outside)

outside_resampling <- resamples(outside_models)

test_roc <- function(model, data) {
  library(pROC)
  roc_obj <- roc(data$Class, 
                 predict(model, data, type = "prob")[, "Class1"],
                 levels = c("Class2", "Class1"))
  ci(roc_obj)
  }

outside_test <- lapply(outside_models, test_roc, data = imbal_test)
outside_test <- lapply(outside_test, as.vector)
outside_test <- do.call("rbind", outside_test)
colnames(outside_test) <- c("lower", "ROC", "upper")
outside_test <- as.data.frame(outside_test)

summary(outside_resampling, metric = "ROC")
outside_test
    end.rcode-->
<br>  
<p>
The training and test set estimates for the area under the ROC curve do not appear to correlate. Based on the resampling results, one would infer that up-sampling is nearly perfect and that ROSE does relatively poorly. The reason that up-sampling appears to perform so well is that the samples in the majority class are replicated and have a large potential to be in both the model building and hold-out sets. In essence, the hold-outs here are not truly independent samples.  
</p>
<p>
In reality, all of the sampling methods do about the same (based on the test set). The statistics for the basic model fit with no sampling are fairly in-line with one another (<!--rinline I(round(getTrainPerf(orig_fit)[, "TrainROC"], 3)) --> via resampling and <!--rinline I(round(outside_test["original", "ROC"], 3))--> for the test set). 
</p>

<div id="resampling"></div>        
<h1>Subsampling During Resampling</h1>

<p>
Recent versions of <strong>caret</strong> allow the user to specify subsampling when using <span class="mx funCall">train</span> so that it is conducted inside of resampling. All four methods shown above can be accessed with the basic package using simple syntax. If you want to use your own technique, or want to change some of the parameters for <span class="mx funCall">SMOTE</span> or <span class="mx funCall">ROSE</span>, the last section below shows how to use custom subsampling. 
</p>
<p>
The way to enable subsampling is to use yet another option in <span class="mx funCall">trainControl</span> called <span class="mx arg">sampling</span>. The most basic syntax is to use a character string with the name of the sampling method, either <tt><span class="mx str">"down"</span></tt>, <tt><span class="mx str">"up"</span></tt>, <tt><span class="mx str">"smote"</span></tt>, or <tt><span class="mx str">"rose"</span></tt>. Note that you will need to have the <strong>DMwR</strong> and <strong>ROSE</strong> packages installed to use SMOTE and ROSE, respectively. 
</p>
<p>
One complication is related to pre-processing. Should the subsampling occur before or after the pre-processing? For example, if you down-sample the data and using PCA for signal extraction, should the loadings be estimated from the entire training set? The estimate is potentially better since the entire training set is being used but the subsample may happen to capture a small potion of the PCA space. There isn't any obvious answer.  
</p>
<p>
The default behavior is to subsample the data prior to pre-processing. This can be easily changed and an example is given below. 
</p>
<p>
Now let's re-run our bagged tree models while sampling inside of cross-validation:
</p>
<!--begin.rcode inside_models, echo=TRUE, message=FALSE, cache = TRUE
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     ## new option here:
                     sampling = "down")

set.seed(5627)
down_inside <- train(Class ~ ., data = imbal_train, 
                     method = "treebag",
                     nbagg = 50,
                     metric = "ROC",
                     trControl = ctrl)

## now just change that option
ctrl$sampling <- "up"

set.seed(5627)
up_inside <- train(Class ~ ., data = imbal_train, 
                   method = "treebag",
                   nbagg = 50,
                   metric = "ROC",
                   trControl = ctrl)

ctrl$sampling <- "rose"

set.seed(5627)
rose_inside <- train(Class ~ ., data = imbal_train, 
                     method = "treebag",
                     nbagg = 50,
                     metric = "ROC",
                     trControl = ctrl)

ctrl$sampling <- "smote"

set.seed(5627)
smote_inside <- train(Class ~ ., data = imbal_train, 
                      method = "treebag",
                      nbagg = 50,
                      metric = "ROC",
                      trControl = ctrl)
    end.rcode-->
<br>  
<p>
Here are the resampling and test set results:
</p>
<!--begin.rcode inside_results, echo=TRUE, message=FALSE, cache = TRUE
inside_models <- list(original = orig_fit,
                      down = down_inside,
                      up = up_inside,
                      SMOTE = smote_inside,
                      ROSE = rose_inside)

inside_resampling <- resamples(inside_models)

inside_test <- lapply(inside_models, test_roc, data = imbal_test)
inside_test <- lapply(inside_test, as.vector)
inside_test <- do.call("rbind", inside_test)
colnames(inside_test) <- c("lower", "ROC", "upper")
inside_test <- as.data.frame(inside_test)

summary(inside_resampling, metric = "ROC")
inside_test
    end.rcode-->

<!--begin.rcode compare_results, echo=FALSE, message=FALSE, cache = TRUE
inside_roc <- inside_resampling$values[, grepl("ROC", names(inside_resampling$values))]
inside_means <- apply(inside_roc, 2, mean)
names(inside_means) <- gsub("~ROC", "", names(inside_means))
inside_means <- data.frame(model = names(inside_means),
                           CV = as.vector(inside_means))
inside_means$When <- "Inside"
inside_test$model <- rownames(inside_test)

inside_results <- merge(inside_test[, c("model", "ROC")], inside_means)

outside_roc <- outside_resampling$values[, grepl("ROC", names(outside_resampling$values))]
outside_means <- apply(outside_roc, 2, mean)
names(outside_means) <- gsub("~ROC", "", names(outside_means))
outside_means <- data.frame(model = names(outside_means),
                            CV = as.vector(outside_means))
outside_means$When <- "Outside"
outside_test$model <- rownames(outside_test)

outside_results <- merge(outside_test[, c("model", "ROC")], outside_means)
outside_results$When[outside_results$model == "original"] <- "No Sampling"
plot_data <- rbind(inside_results, outside_results)
plot_data$Diff <- abs(plot_data$ROC - plot_data$CV)

rng <- extendrange(c(plot_data$CV, plot_data$ROC))
end.rcode-->
<br>  
<p>
The figure below shows the difference in the area under the ROC curve and the test set results for the approaches shown here. Repeating the subsampling procedures for every resample produces results that are more consistent with the test set. 
</p>


<!--begin.rcode compare, tidy=FALSE, fig.height=5, fig.width=7,echo = FALSE 
ggplot(plot_data, aes(x = model, y = Diff, color = When)) +
  geom_point() + 
  theme(legend.position = "top") + 
  ylab("Absolute Difference Between CV and Test Results") + xlab("")
    end.rcode-->  


<div id="complications"></div>        
<h1>Complications</h1>

<p>
The user should be aware that there are a few things that can happening when subsampling that can cause issues in their code. As previously mentioned, when sampling occurs in relation to pre-processing is one such issue. Others are:
</p>
<ul>
  <li>
  	Sparsely represented categories in factor variables may turn into zero-variance predictors or may be completely sampled out of the model. 
  </li>
  <li>
  	The underlying functions that do the sampling (e.g. <span class="mx funCall">SMOTE</span>, <span class="mx funCall">downSample</span>, etc) operate in very different ways and this can affect your results. For example, <span class="mx funCall">SMOTE</span> and <span class="mx funCall">ROSE</span> will convert your predictor input argument into a data frame (even if you start with a matrix). 
  </li>
  <li>
  	Currently, sample weights are not supported with sub-sampling. 
  </li> 
  <li>
  	If you use <span class="mx arg">tuneLength</span> to specify the search grid, understand that the data that is used to determine the grid has not been sampled. In most cases, this will not matter but if the grid creation process is affected by the sample size, you may end up using a sub-optimal tuning grid.  
  </li> 
  <li>
  	For some models that require more samples than parameters, a reduction in the sample size may prevent you from being able to fit the model. 
  </li>     
</ul>

<div id="custom"></div>        
<h1>Using Custom Subsampling Techniques</h1>

<p>
Users have the ability to create their own type of subsampling procedure. To do this, alternative syntax is used with the <span class="mx arg">sampling</span> argument of the <span class="mx funCall">trainControl</span>. Previously, we used a simple string as the value of this argument. Another way to specify the argument is to use a list with three (named) elements:
</p>
<ul>
  <li>
  	The <span class="mx arg">name</span> value is a character string used when the <code>train</code> object is printed. It can be any string. 
  </li>
  <li>
  	The <span class="mx arg">func</span> element is a function that does the subsampling. It should have arguments called <span class="mx arg">x</span> and <span class="mx arg">y</span> that will contain the predictors and outcome data, respectively. The function should return a list with elements of the same name.  
  </li>
  <li>
  	The <span class="mx arg">first</span> element is a single logical value that indicates whether the subsampling should occur first relative to pre-process. A value of <tt><span class="mx num">FALSE</span></tt> means that the subsampling function will receive the sampled versions of <span class="mx arg">x</span> and <span class="mx arg">y</span>.
  </li> 
</ul>
<p>
For example, here is what the list version of the <span class="mx arg">sampling</span> argument looks like when simple down-sampling is used:
</p>
<!--begin.rcode example_list, message=FALSE, cache = TRUE
down_inside$control$sampling
end.rcode-->
<br>
<p>
As another example, suppose we want to use SMOTE but use 10 nearest neighbors instead of the default of 5. To do this, we can create a simple wrapper around the <span class="mx funCall">SMOTE</span> function and call this instead:
</p>
<!--begin.rcode smote_list, message=FALSE
smotest <- list(name = "SMOTE with more neighbors!",
                func = function (x, y) {
                  library(DMwR)
                  dat <- if (is.data.frame(x)) x else as.data.frame(x)
                  dat$.y <- y
                  dat <- SMOTE(.y ~ ., data = dat, k = 10)
                  list(x = dat[, !grepl(".y", colnames(dat), fixed = TRUE)], 
                       y = dat$.y)
                  },
                first = TRUE)
end.rcode-->
<br>
<p>
The control object would then be:
</p>
<!--begin.rcode smote_call, message=FALSE
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     sampling = smotest)
end.rcode-->


<div style="clear: both;">&nbsp;</div>
  </div>
  <!-- end #content -->
<div id="sidebar">
<ul>
  <li>
    <h2>General Topics</h2>
    <ul>
      <li><a href="index.html">Front Page</a></li>
      <li><a href="visualizations.html">Visualizations</a></li>
      <li><a href="preprocess.html">Pre-Processing</a><li>
      <li><a href="splitting.html">Data Splitting</a></li>
      <li><a href="varimp.html">Variable Importance</a></li>
      <li><a href="other.html">Model Performance</a></li>
      <li><a href="parallel.html">Parallel Processing</a></li>
    </ul>
    <h2>Model Training and Tuning</h2>
    <ul>
      <li><a href="training.html">Basic Syntax</a></li>
      <li><a href="modelList.html">Sortable Model List</a></li>
      <li><a href="bytag.html">Models By Tag</a></li>
      <li><a href="similarity.html">Models By Similarity</a></li>
      <li><a href="custom_models.html">Using Custom Models</a></li>
      <li><a href="sampling.html">Sampling for Class Imbalances</a></li> 
      <li><a href="random.html">Random Search</a></li> 
      <li><a href="adaptive.html">Adaptive Resampling</a></li> 
    </ul>
    <h2>Feature Selection</h2>
    <ul>
      <li><a href="featureselection.html">Overview</a>
      <li><a href="rfe.html">RFE</a></li>
      <li><a href="filters.html">Filters</a></li>
      <li><a href="GA.html">GA</a></li>
      <li><a href="SA.html">SA</a></li>
    </ul>  
  </li>
</ul>
</div>
<!-- end #sidebar -->
<div style="clear: both;">&nbsp;</div>
  </div>
  <div class="container"><img src="images/img03.png" width="1000" height="40" alt="" /></div>
  <!-- end #page -->
</div>
  <div id="footer-content"></div>
  <div id="footer">
  <!--begin.rcode echo = FALSE
    knit_hooks$set(inline = hook_inline)    
    end.rcode-->   
  <p>Created on <!--rinline I(session) -->.</p>
  </div>
  <!-- end #footer -->
</body>
  </html>
